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ABSTRACT 

HPC environments have traditionally been designed to meet 
the compute demand of scientific applications and data has 
only been a second order concern. With science moving 
toward data-driven discoveries relying more and more on 
correlations in data to form scientific hypotheses, the limi¬ 
tations of existing HPC approaches become apparent: Ar¬ 
chitectural paradigms such as the separation of storage and 
compute are not optimal for I/O intensive workloads (e. g. 
for data preparation, transformation and SQL workloads). 
While there are many powerful computational and analyti¬ 
cal kernels and libraries available on HPC (e. g. for scalable 
linear algebra), they generally lack the usability and vari¬ 
ety of analytical libraries found in other environments (e. g. 
the Apache Hadoop ecosystem). Further, there is a lack 
of abstractions that unify access to increasingly heteroge¬ 
neous infrastructure (HPC, Hadoop, clouds) and allow rea¬ 
soning about performance trade-offs in these complex envi¬ 
ronments. At the same time, the Hadoop ecosystem is evolv¬ 
ing rapidly with new frameworks for data processing and 
has established itself as de-facto standard for data-intensive 
workloads in industry and is increasingly used to tackle sci¬ 
entific problems. In this paper, we explore paths to interop¬ 
erability between Hadoop and HPC, examine the differences 
and challenges, such as the different architectural paradigms 
and abstractions, and investigate ways to address them. We 
propose the extension of the Pilot-Abstraction to Hadoop to 
serve as interoperability layer for allocating and managing 
resources across different infrastructures providing a degree 
of unification in the concepts and implementation of resource 
management across HPC, Hadoop and other infrastructures. 
For this purpose, we integrate Hadoop compute and data re¬ 
sources (i. e. YARN and HDFS) with the Pilot-Abstraction. 

In-memory capabilities have been successfully deployed 
to enhance the performance of large-scale data analytics 
approaches (e. g. iterative machine learning algorithms) for 
which the ability to re-use data across iterations is critical. 
As memory naturally fits in with the Pilot concept of re¬ 


taining resources for a set of tasks, we propose the extension 
of the Pilot-Abstraction to in-memory resources. These en¬ 
hancements to the Pilot-Abstraction have been implemented 
in BigJob. Further, we validate the abstractions using ex¬ 
periments on cloud and HPC infrastructures investigating 
the performance of the Pilot-Data and Pilot-Hadoop imple¬ 
mentation, HDFS and Lustre for Hadoop MapReduce work¬ 
loads, and Pilot-Data Memory for KMeans clustering. Using 
Pilot-Hadoop we evaluate the performance of Stampede, a 
compute-centric resource, and Gordon, a resource designed 
for data-intensive workloads providing additional memory 
and flash storage. Our benchmarks of Pilot-Data Memory 
show a significant improvement compared to the file-based 
Pilot-Data for KMeans with a measured speedup of 212. 

1. INTRODUCTION 

As more scientific disciplines rely on data as an impor¬ 
tant means for scientific discovery, the demand for infras¬ 
tructures that support data-intensive tasks in addition to 
traditional compute-intensive tasks, such as modeling and 
simulations, is increasing. For example, in biology and as¬ 
tronomy scientific discovery is increasingly based on analysis 
of data collected from machines, such as genome sequencing 
machines or observatories 1 . Across disciplines there is a 
move towards data-driven discovery and with increasing di¬ 
versity in the source of data (c.f. the Internet of Things and 
the usage of networked sensors to collect data). The term 
“fourth paradigm” [2] refers to scientific discovery based on 
data in addition to theory, experimentation and simulation 
based discovery. 

Data-intensive applications are associated with a wide va¬ 
riety of characteristics and properties, as summarized by 
Fox et al. 3, [4] . Often, they are more complex and hetero¬ 
geneous than HPC applications as they typically comprise of 
multiple stages with different characteristics. Typical stages 
are: data ingest, pre-processing, feature-extraction and ad¬ 
vanced analytics. While some of these stages are I/O bound 
with potential different I/O characteristics (random vs. se¬ 
quential access), some stages (e. g. advanced analytics) are 
compute- and memory bound. Managing and supporting 
such heterogeneous application workflows on top of hetero¬ 
geneous resources at scale represents an important challenge. 

HPC infrastructures introduced parallel filesystems, such 
as Lustre or CPFS, to meet the increased I/O demands of 
data-intensive applications and archival storage to address 
the need for retaining large volumes of primary simulation 
output data. The parallel filesystem model of using large, 
optimized storage clusters exposing a POSIX compliant rich 


interface and connecting it to compute nodes via a fast in¬ 
terconnects works well for compute-bound task. It has how¬ 
ever, some limitations for data-intensive, I/O-bound work¬ 
loads that require a high sequential read/write performance. 
Similarly, high-throughput infrastructures (HTC), such as 
OSG, rely on separate storage environments, e. g. SRM or 
iRODS based, to manage data. In clouds, object stores are a 
common mechanism for persisting data outside of ephemeral 
VM-based compute resources. To address some of these is¬ 
sues, new - mostly hardware - capabilities have been added 
to HPC resources to meet the demands of data-intensive 
applications; machines such as Gordon or Wrangler, provide 
large memory nodes to facilitate shared memory data an¬ 
alytics tools (e. g. R, Python) and additional storage tiers 
(e.g. SSD). 

Hadoop [5] and the MapReduce abstraction 6 were de¬ 
veloped with data as first order consideration and estab¬ 
lished the the de-facto standard for data-intensive comput¬ 
ing. The biggest differentiator of Hadoop compared to HPC 
systems is data-locality: while HPC systems generally rely 
on fast interconnects between compute and storage, Hadoop 
co-locates compute and data. The Hadoop ecosystems - in 
the following referred to as Apache Big Data Stack (ABDS) 
- provides a manifold set of novel tools and higher-level ab¬ 
stractions for data processing. In addition to MapReduce, 
Spark [f gained popularity for memory-centric data process¬ 
ing and analytics. ABDS tools and frameworks are increas¬ 
ingly used in sciences (see [8j [9]). While the Hadoop plat¬ 
form has proven its value in scientific applications, challenges 
remain in deploying ABDS applications on HPC infrastruc¬ 
ture and when integrating these with HPC applications, e. g. 
based on MPI 10, II] . 

A main differentiator of ABDS are high-level abstractions 
that trade-off capabilities and performance; MapReduce and 
Spark’s RDDs offer the ability to reason about data process¬ 
ing steps (e. g. filtering, transformations and aggregations) 
without the need to explicitly implement data parallelism. 
Further important capabilities are offered in the advanced 
analytics and machine learning domain. MPI provides a 
good abstraction for implementing parallel applications pro¬ 
viding primitives for point-to-point and collective communi¬ 
cations; however, it lacks the productivity of higher-level 
ABDS abstractions. 

Over the past decades, the High Performance Distributed 
Computing (HPDC) community has made significant ad¬ 
vances in addressing resource and workload management on 
heterogeneous resources. In contrast, the ABDS ecosystem 
has been evolving to provide similar levels of sophistication 
for commercial enterprise. In fact, some conceptual ideas 
dominant in the HPDC world have already started mak¬ 
ing an impact in ABDS tools. For example, the concept of 
multi-level scheduling as manifest in the decoupling work¬ 
load assignment from resource management via the concept 
of intermediate container jobs (also referred to as Pilot- 
Jobs 12]) has made its presence in ABDS after the tran¬ 
sition from Hadoop to YARN/Mesos. This concept has 
been adapted and refined in the ABDS environment allow¬ 
ing frameworks to retain resources (cores, memory, stor¬ 
age) and expand/shrink the resource pool if necessary. It 
turns out to be the case that multi-level scheduling is even 
more important for data-intensive applications as often only 
application-level schedulers can be aware of the localities of 
the data sources used by a specific application. This mo¬ 


tivated the extension of the Pilot-Abstraction to support 
data-aware scheduling on application-level 13]. As most 
HPC schedulers are data agnostic, this is an important/- 
critical extension of capabilities. However, this advance has 
had the interesting consequence that heterogeneity in ABDS 
systems is now no longer confined to the filesystems and 
resource-access mechanisms, but like traditional HPDC sys¬ 
tems, the resource management interface and semantics are 
different. 

Collectively, the above features and trends point to the 
possibility and need for consilience between the HPDC and 
ABDS approaches for resource management. The aim of this 
paper is to examine the need for consilience in resource man¬ 
agement approaches between ABDS and traditional HPDC 
approaches, understand some of the challenges on the path 
to doing so and explore the Pilot-Abstraction as one possible 
way to address some of the challenges. 

Another concern of this paper is to understand how to ad¬ 
dress the issues of interoperability: There is a great need to 
integrate both HPDC and Hadoop, e. g. due to the necessity 
to co-locate data/compute or to combine compute-centric 
HPC applications (e. g. linear algebra solvers) with ABDS 
applications (e. g. MapReduce and Spark). In this paper, we 
explore the usage of the Pilot-Abstraction inside an ABDS 
environment, as well as a path for the interoperable use of 
ABDS applications on HPDC infrastructure. 

This paper makes the following contributions: (i) We 
propose several extensions to the Pilot-Abstraction [12] to 
better facilitate data processing and advanced analytics on 
HPDC and to support interoperability with Hadoop; specifi¬ 
cally, we design and implement Pilot-Abstractions that pro¬ 
vides a common approach for data-aware resource manage¬ 
ment on and across HPC, cloud and Hadoop infrastructures. 
By supporting Hadoop’s resource manager YARN, the Pilot- 
Abstraction can be used as standard application framework 
simplifying the usage of HPC application on Hadoop. (ii) 
We extend an implementation of the Pilot-Abstraction to fa¬ 
cilitate the deployment and execution of ABDS applications 
on HPC. Pilot-Hadoop enables the dynamic, ad-hoc creation 
of Hadoop or Spark clusters on HPC infrastructures, (iii) 
We extend Pilot-Data for distributed in-memory comput¬ 
ing that is essential for scalable analytics, such as iterative 
machine learning. Pilot-Data Memory provides a unified 
way to access distributed memory within data-intensive ap¬ 
plications that is integrated with the data-affinity model of 
Pilot-Data. Pilot-Data offers a unified approach for data 
management across complex storage hierarchies comprising 
of local disks, cloud storage, parallel filesystems, SSD and 
memory, (iv) Finally, we validate the proposed abstractions 
and tools on XSEDE using Stampede and Gordon (machine 
designed for data-intensive applications). We investigate the 
performance of HDFS and Lustre using a MapReduce work¬ 
load and of KMeans running on different ABDS and HPC- 
backends using the Pilot-Abstraction. 

This paper is structured as follows: In section[2]we survey 
the ABDS ecosystem and compare the provided tools and 
abstractions with the HPC environment. We continue with 
a discussion of the new Pilot-Data capabilities that support 
Hadoop/HPC interoperability as well as advanced analytics 
and machine learning applications in section [3] The results 
of our experimental validation are presented in section]!] We 
conclude with a discussion of the contributions and lessons 
learn as well as relevant future issues in section [5] 


2. BACKGROUND AND RELATED WORK File-based Abstractions 


Hadoop emerged separate from high-performance comput¬ 
ing in enterprise environments inspired by Google’s cloud 
infrastructure based on the Google Filesystem and MapRe¬ 
duce [6]. Apache Hadoop evolved to a general purpose clus¬ 
ter computing framework suited for many kinds of data- 
intensive applications 14]. While Hadoop MapReduce 
lacks behind some capabilities, e.g., high-performance inter¬ 
process communication, a set of novel, high-level abstrac¬ 
tions and runtime environments emerged on top of the core 
Hadoop services, the Hadoop Filesystem (HDFS) and the 
YARN resource manager. YARN allows the deployment of 
any application on a Hadoop cluster without the need to 
retrofit applications into the MapReduce model as required 
for Hadoop 1. 

In this section, we summarize our previous work in ana¬ 
lyzing the characteristics and properties HPC and Hadoop 
Infrastructures as well as related work to provide abstrac¬ 
tions and runtime environment for such applications. 


2.1 HPC and ABDS Abstractions 

This section describes abstractions and runtime systems 
for data-intensive computing in the HPC and ABDS do¬ 
mains. As alluded to earlier, data-intensive applications are 
typically more heterogeneous than compute-intensive sim¬ 
ulations. Data-intensive applications are often modeled as 
a pipeline of tasks (from data ingest, storage, processing to 
analytics) or a direct acyclic graph with tasks corresponding 
to nodes in the graph. We use the term workflow to refer to 
multi-stage data-processing; data pipeline is often used as a 
synonym for this. 

A common concern is the provision of a scalable environ¬ 
ment support different stages of data processing and ana¬ 
lytics workflows: from coarse-grained data parallelism for 
data-Hltering with MapReduce to fine-grained parallelism 
for machine learning that often relies on scalable linear al¬ 
gebra libraries. In this section, we give a brief overview of 
HPC libraries and abstractions and investigate their usage 
within data-intensive workflows. We further compare and 
contrast these approaches to abstractions developed in the 
ABDS world. 


Communication-Abstractions (MPI) 

Analytics and machine learning algorithms can often be ex¬ 
pressed and transformed into matrix operations and thus, 
require efficient linear algebra implementation. MPI and 
OpenMP are the standard abstraction for implementing 
parallel applications on HPC infrastructures. While they 
provide an important building block for scalable analyt¬ 
ics, they lack the usability of higher-level abstractions (such 
as the R dataframe). Several parallel, numerical libraries 
(MPI/OpenMP-based) that support analytics have been de¬ 
veloped, e. g. ScaLAPACK 15] or ARPACK 16]. These par¬ 
allel libraries are based on the lower-level libraries LAPACK 
and BLAS. While they provide an important building block 
for scalable analytics, they lack the usability of higher-level 
abstractions. Which the introduction of YARN, the exe¬ 
cution of MPI applications on Hadoop clusters is well sup¬ 
ported, e. g. MPICH2-YARN and Hamster, an YARN appli¬ 
cation master for OpenMPI. 


Scientific applications are commonly based on files. Many 
abstractions for improving file management in the HPC con¬ 
text emerged: Filecules 17 e. g. simplify the the process¬ 
ing of file groups. Similarly, other data management sys¬ 
tems, such as iRods often work on collection/groups of files. 
However, file-based data management is often associated 
with inefficiency in particular for temporary data, e. g. in¬ 
termediate data of iterative machine learning applications. 
iRods [18] utilizes so-called collections to group and manage 
files. While these tools allow a logical grouping of data, they 
provide limited support for data-parallel processing of these 
files as data partitioning and processing is done outside of 
these tools. 


Pilot-Jobs and Workflows 

Pilot-Jobs 12] have been developed to support ensembles 
of fine-grained tasks in HPC environments; examples of 
Pilot-Jobs are Condor-G/Glideln [19] or BigJob 20 . By 
using a set of placeholder jobs distributed across multiple 
resources, the system can accommodate dynamic task work¬ 
loads through a dynamically adjusting resource pool improv¬ 
ing the overall utilization at the same time. Pilot-Jobs have 
been successfully used to enhance the performance of scien¬ 
tific workflows. 

Scientific workflows popularized the direct acyclic graphs 
(DAGs) abstraction as fundamental way to express work- 
flows; examples of such systems are Pegasus [21] and Tav- 
erna [22]. Many of these system focus on the higher-level 
workflow abstraction lacking a high-performance, vertically 
integrated runtime system (as provided by the ABDS stack). 
Another constraint is the fact that the dataflow is often 
based on files. In ABDS similar DAG abstractions emerged, 
which are implemented on top of ABDS primitives, such as 
MapReduce, and are discussed in the following section. 

Pilot-Jobs have been extended to facilitate data manage¬ 
ment (mainly on file-basis) and for dataflow-oriented work- 
flows. Falcon [23] provides a data-aware scheduler on top 
of a pool of dynamically acquired compute and data re¬ 
sources [24j. The so called data diffusion mechanism can 
cache data on Pilot-level enabling the efficient re-use of 
data. Another area of research is the utilization of dis¬ 
tributed memory for data-intensive, task-based workflows. 
Swift/T [25 is a rewrite of the Swift to utilize the MPI- 
based Turbine engine for processing of data-intensive work- 
flows benefiting from MPI features, such as effective collec¬ 
tive communications. 

MapReduce & Higher-Level Abstractions 

MapReduce [f| proofed an effective abstraction for process¬ 
ing data in a parallel way decoupling storage backend, data 
formats and processing engine. Hadoop MapReduce estab¬ 
lished itself as de-facto standard for scalable data processing 
- in contrast to file-based approaches found in the scien¬ 
tific workflow community - MapReduce hides complex de¬ 
tails, such as data organization, formats, data partitioning 
and aggregation. While MapReduce simplified the creation 
of data-intensive application (particularly applications that 
need to process vast volumes of data), the MapReduce ab¬ 
straction is limited in its expressiveness as pointed out by 
various authors [26, [27 and lead to manifold higher-level 
abstractions for implementing sophisticated data pipelines. 

In addition, the native APIs provided by Hadoop are com- 



plex: the creation of a simple application requires an imple¬ 
mentation of a map and reduce function, as well as various 
configurations and auxiliary functions. Also, the creation of 
more complex data pipelines or iterative machine learning 
applications consisting of multiple MapReduce jobs is very 
complex. Thus, a set of high-level APIs, such as Apache 
Crunch 28 , Cascading [29], Apache Flink 30 and Spring 


XD [31], emerged. While the expressiveness of the API was 
better than MapReduce, they still were constrained by the 
MapReduce runtime system. 

With the emergence of YARN, several new processing en¬ 
gines emerged in the Hadoop ecosystems that improved the 
support for workflows. Framework, such as Spark 7 and 
Tez [32], provide richer abstractions that are built on modern 
processing engines that can retain resources across task gen¬ 
erations and effectively utilize distributed memory. Spark’s 
reliable distributed dataset (RDD) abstraction provides a 
powerful way to manipulate distributed collection stored in 
the memory of the cluster nodes. Spark is increasingly used 
for building complex data workflows and advanced analytic 
tools, such as MLLib [33] and SparkR [34] , 

As alluded, the ability to utilize efficient collectives is im¬ 
portant for advanced analytics implementations that often 
require fast linear algebra implementations. Some hybrid 
frameworks have been proposed to provide MPI-style collec¬ 
tive operations in conjunction with data-intensive abstrac¬ 
tions, such as MapReduce. For example, Harp [35] pro¬ 
poses the usage of collective operations in the map-phase of 
a MapReduce application - the model is referred to Map- 
Collective. This model is similar to the bulk-synchronous 
communication model or the MPI communication model; 
Harp however aims to provide a higher-level abstraction 
than MPI. In contrast to other abstraction, such as MapRe¬ 
duce, the user is required to manually manage data par¬ 
titions and processes. Different implementations and algo¬ 
rithms for the HARP collective layer have been investigated, 
e. g. based on based e.g. on the Netty and Azure inter-role 
communication mechanism. A constraint is that the col¬ 
lective framework of Harp currently only supports a static 
group of resources. 


Dataset and Dataframe Abstractions 

The dataset or dataframe abstractions originally introduced 
in R (respectively his predecessor S) exposes data in a tab¬ 
ulated format to the user and supports the efficient expres¬ 
sion of data transformations and analytics [36]. A dataframe 
typically stores data matrix of different types of data (e. g. 
numeric, text data and categorical data). The dataframe ab¬ 
straction supports various functions for manipulating data 
stored inside the data structures, e. g. to subset, merge, fil¬ 
ter and aggregate data, using well-defined primitives or SQL. 
Similar abstractions emerged for other languages and run¬ 
time environments, e. g. Pandas 37 , Scikit-Learn 38 and 


Dato SFrame [39 for Python. For Mahout a dataframe ab¬ 
straction has been proposed. 

While the dataframe abstractions are very expressiveness 
and well-suited for suited for implementing advanced ana¬ 
lytics, they traditionally lacked the ability to scale-out. The 
MLI-inspired Spark Pipeline API 40 for Spark, H20 [4l], 
and Blaze 02] for Python attempt to introduce similar ab¬ 
stractions also supporting scalable backends - however, they 
currently lack in the variety of analytics frameworks and al¬ 
gorithms available for these tools. Another constraint is that 


many of these framework, e.g. Spark, are Java-based mak¬ 
ing it difficult to integrate these with native libraries, e. g. 
parallel linear algebra libraries from the scientific comput¬ 
ing domain. HPC abstractions in contrast to the dataframe 
or MapReduce abstraction focus on low-level functions (e. g. 
MPI for communications). Thus, they often do not provide 
an easy way to explore data-parallelism. Further, higher- 
level dataset/dataframe abstractions designed for domain 
scientists are typically not available. 

The ability to combine different frameworks and abstrac¬ 
tions is the key for implementing end-to-end data-intensive 
workflows with multiple stages with different workload char¬ 
acteristics. To address these complex requirements, it can 
be expected, that the design space for abstractions will be 
further explored and more hybrid approaches will emerge 
allowing for efficiently supporting end-to-end data pipelines 
(including analytics). Another constraint is the fact, that 
the majority of the available runtime systems is based on 
Java. Python - a popular language for scientific computing 
- is typically only supported as secondary option. 

2.2 HPC and Hadoop 

The heterogeneity of distributed infrastructures is still in¬ 
creasing: HPC and Hadoop e. g. following dramatically dif¬ 
ferent design paradigms. In HPC environments traditionally 
storage and compute are separated connected by an high-end 
network (e. g. Inffniband). To address the increasing need to 
process large volumes of data on HPC, the capacity of these 
storage systems and the networks. System, such as Wran¬ 
gler 03], maintain this separation. Wrangler in particular 
deploys a mix of a Lustre storage system, a SSD-based high- 
end storage system and local storage. In addition archival 
storage systems often based on HPSS are used to store cold 
data. 

Resource Management: The fine-grained data-parallelism 
of data-intensive applications is ideally suited for multi¬ 
level scheduling. Multi-level scheduling 02] originated in 
HPC environments and proofed as efficient mechanism to 
managed ensembles of tasks. By decoupling system-level 
and application-level scheduling and by taking into account 
system-level, such as resources utilization and allocation 
policies, as well as application-level objectives, e. g. dy¬ 
namic resource requirements for every application stage, in 
most cases a superior performance can be achieved. Pilot- 
Jobs 12] provided a powerful abstraction for implement¬ 
ing multi-level scheduling on HPC systems. While Hadoop 
originally only provided a rudimentary scheduling system, 
the new YARN scheduler provides efficient support for 
application-level scheduling. 

YARN address the need that with the uptake of Hadoop, 
the requirements with respect to resource management in¬ 
creased: more complex data localities (memory, SSDs, disk, 
rack, datacenter), long-lived services, periodic jobs, interac¬ 
tive and batch jobs need to be supported on the same en¬ 
vironment. Multi-level scheduling serves as the basic archi¬ 
tectural principle to support this requirement. YARN 4b], 
Hadoop’s resource, aims to address these limitations. I 11 con¬ 
trast, to traditional batch schedulers, YARN is optimized for 
data-intensive environments supporting data-locality and 
the management of a large number of fine-granular tasks 
(found in data-parallel applications). YARN enables ap¬ 
plications to deploy their own application-level schedul¬ 
ing routines on top of Hadoop-managed storage and com- 





pute resources. While YARN manages the lower resources, 
the higher-level runtimes typically use an application-level 
scheduler to optimize resource usage for the application. 
Applications need to initialize their so-called Application- 
Master via YARN; the Application Master is then respon¬ 
sible for allocating resources - containers - for the applica¬ 
tions and to execute tasks in these containers. Data locality, 
e. g. between HDFS blocks and container location need to 
manually managed by the application master (by request¬ 
ing containers on specific nodes/racks etc.). Other resource 
management systems addressing similar needs emerged, e. g. 
Mesos 46 Omega 4T Google’s distributed scheduler. 


Managing resources on top of YARNs is associated with 
several challenge: while the default design particularly fa¬ 
cilitates the fine-grained data parallelism of MapReduce, for 
certain application characteristics it is desirable to retain re¬ 
sources during longer period, e. g. to longer cache data that is 
often re-used in-memory, to have readily available resources 
for interactive applications or to facilitate iterative process¬ 
ing. Several higher-level frameworks for YARN addressing 
specific application characteristics emerged: Llama [48] of¬ 
fers a long-running application master for YARN designed 
for the Impala SQL engine. Apache Slider [49 supports 
long-running distributed application on YARN with dy¬ 
namic resource needs allowing applications to scale to addi¬ 
tional containers on demand. TEZ [32] is a DAG processing 
engine primarily designed to support the Hive SQL engine 
allowing the application to hold containers across multiple 
phases of the DAG execution without the need to de-/re- 
allocate resources. REEF [50] is a similar runtime environ¬ 
ment that provides applications a higher-level abstractions 
to YARN resources allowing it to retain memory and cores 
supporting heterogeneous workloads. REEF Retainers can 
re-use JVM and store data in the JVM’s memory. 

Interoperability: To achieve interoperability, several frame¬ 
works explore the usage of Hadoop on HPC resources. Re¬ 
source managers, such as Condor and SLURM, provide 
Hadoop support. Further, various third-party systems, such 
as SAGA-Hadoop [51 , JUMMP [52] or MyHadoop 53 , ex¬ 


ist. A main disadvantage with this approach is the loss of 
data-locality, which the system-level scheduler is typically 
not aware of. In addition there are several other limitations 
associate with that approach: e. g. the necessity to load data 
into HDFS and the ability to achieve higher cluster utiliza¬ 
tion by more fine-grained resource sharing. 

For Hadoop deployments typically local storage is pre¬ 
ferred; nevertheless some deployments use a HPC-style sep¬ 
aration of compute and storage systems. Hadoop workloads 
on these HPC systems is supported via a special client li¬ 
brary, which improves the interoperability with Hadoop; it 
limits however data locality and the ability for the applica¬ 
tion to optimize for data placements since applications are 
commonly not aware of the complex storage hierarchy. An¬ 
other interesting opportunity is the usage of Hadoop as ac¬ 
tive archival storage - in particular, the newly added HDFS 
heterogeneous storage support is suitable for supporting this 


Summary: Understanding performance in a heteroge¬ 
neous, distributed environments is complex. In the remain¬ 
der of the paper, we investigate the Pilot-Abstraction as 
unifying concept to efficiently support the interoperability 



Figure 1: Hadoop and HPC Interoperability: There 
are two options: a. Running HPC applications in¬ 
side a YARN cluster, b. spawning a YARN or Spark 
cluster on a HPC environment. 


between HPC and Hadoop. By utilizing the multi-level 
scheduling capabilities of YARN, Pilot-Data can efficiently 
manage Hadoop cluster resources providing the application 
with the necessary means to reason about data and com¬ 
pute resources and allocation. On the other side, we show, 
how the Pilot-Abstraction can be used to manage ABDS 
application on HPC environments. 


PILOT-ABSTRACTION FOR HADOOP: 
DESIGN AND IMPLEMENTATION 


The Pilot-Abstraction 


12 has been successfully used in 


HPDC for supporting a diverse set of task-based workloads 
on distributed resources. A Pilot-Job provides the ability 
to utilize a placeholder job as a container for a dynam¬ 
ically determined set of compute tasks. The Pilot-Data 
abstraction 13 extends the Pilot-Abstraction for support¬ 


ing the management of data in conjunction with compute 
tasks. The Pilot-Abstraction defines the following entities: 
A Pilot-Compute allocates a set of computational resources 
(e. g. cores); a Pilot-Data represents space on a physical stor¬ 
age resource. Further, the abstraction defines a Compute- 
Unit (CU) as a self-contained piece of work represented as 
executable that is submitted to the Pilot-Job. A Data-Unit 
(DU) represents a self-contained, related set of data. 

The Pilot-Job and Pilot-Data abstractions have been im¬ 
plemented within BigJob 12] [20], an interoperable Pilot- 
Abstraction framework for supporting heterogeneous tasks- 
based workloads on heterogeneous infrastructures. In this 
section, we present the extensions made to the Pilot- 
Abstraction to facilitate a broader set of data-intensive ap¬ 
plications and infrastructures, such as Hadoop. We explore 
several options for integrating HPC and Hadoop environ¬ 
ments as depicted in Figure [T] using the Pilot-Abstraction. 
By providing support for Hadoop inside BigJob (Figure [l^,) 
HPC applications can run on Hadoop YARN clusters with¬ 
out modification. Pilot-Hadoop enables the execution of 
ABDS applications inside environments managed by tradi¬ 
tional HPC schedulers (Figure [lja). 

In section [371] we describe how the Pilot-Abstraction and 
the BigJob implementation was extended to support Hadoop 
resource manager, i. e. YARN and Mesos, and HDFS stor¬ 
age. We continue with a discussion of Pilot-Hadoop for run¬ 
ning YARN as application-level scheduler on HPC resources 































in section |3.2| Finally, we present Pilot-Data Memory- an 
infrastructure-agnostic in-memory runtime for analytics ap¬ 
plications in section [33] 

3.1 Pilot-Abstraction: Interoperable Access 
and Management of Hadoop Resources 

With the introduction of YARN, arbitrary applications 
can be executed within Hadoop clusters. Nevertheless, uti¬ 
lizing an Hadoop environment outside of higher-level frame¬ 
works, such as MapReduce and Spark, is a difficult task. 
Established abstractions that enable the user to reason 
about compute and data resources across infrastructures 
(i. e. Hadoop, HPC and clouds) are missing. Also, the new 
generation of schedulers that emerged in the YARN space 
impose a more stringent requirements on the application. 
While schedulers such as YARN or Mesos effectively facili¬ 
tate application-level scheduling, the development efforts for 
YARN and Mesos applications are very high. YARN pro¬ 
vides e. g. only a very low-level abstraction for resource man¬ 
agement - the application must be able to work with a subset 
of the requested resources. Also, allocated resources (the so 
called YARN containers) can be preempted by the scheduler. 
Data/compute locality needs to be manually managed by 
the application scheduler by requesting resources at the lo¬ 
cation of an file chunk. To address this, various frameworks 
that aid the development of such applications have been pro¬ 
posed, e.g. Apache Slider [49 and Spring YARN [54] . While 
these frameworks simplify development, they do not address 
concerns such as interoperability and support for geograph¬ 
ically distributed resources. 

To provide a unified, infrastructure-agnostic abstraction, 
the Hadoop resource model as exposed by YARN and HDFS 
must be mapped to the Pilot-Abstraction. The Hadoop re¬ 
source model primarily relies on cores and memory for mod¬ 
eling compute and storage space modeling data resources. 
While HPC resources typically only allocate compute cores 
and nodes, the memory requirement can be easily trans¬ 
lated to a Pilot-Compute description of a YARN-based Pi¬ 
lot. Similarly HDFS space can be mapped to the space 
parameter of the Pilot-Data description. 

To facilitate physical access to Hadoop resources, several 
adaptors had to be developed inside the Pilot-Framework 
BigJob. As shown in Figure [2] the Pilot-Framework is able 
to spawn and manage Compute-Units and Data-Units on 
different kinds of compute and storage resources using a set 
of adaptors. The new YARN, Mesos and HDFS adaptors en¬ 
ables applications to take advantage of new infrastructures 
and manage their CUs and DUs on dynamic resource pools 
located in an Hadoop environment. A particular challenge 
for the implementation of the YARN adaptor is the multi- 
step resource allocation process imposed by YARN, which 
as alluded differs significantly from HPC schedulers. The 
YARN adaptor for BigJob implements a so-called YARN 
Application Master, which is the central instance for man¬ 
aging the resource demands of the application. Once the 
Application Master is started, subsequent resource requests 
are handled by it. The BigJob Application Master will then 
request the specified number of YARN container. Once these 
are allocated by YARN, the Pilot-Agent will be started in¬ 
side these containers. The dispatching of the CUs is done 
via the normal Pilot-Framework internal mechanisms with¬ 
out involvement of YARN. 

The HDFS adaptor access the Hadoop Filesystem using 
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Figure 2: Pilot-Framework Big Job on Heteroge¬ 
neous Infrastructure: BigJob provide a unified ab¬ 
straction for compute and data resources. The ex¬ 
tension for Hadoop are encapsulated in a YARN 
adaptor for Pilot-Compute and HDFS adaptor for 
Pilot-Data. 
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Figure 3: Pilot-Abstraction for Data-Parallel Pro¬ 
cessing: Pilot-Compute allows the flexible alloca¬ 
tion of resources across heterogeneous infrastruc¬ 
tures, Pilot-Data enables Compute Units to seam¬ 
less access datasets on different storage infrastruc¬ 
tures. The In-Memory Data-Unit facilitates data- 
parallel processing using a group of Compute-Units. 


the WebHDFS API. Pilot-Data supports access to data from 
different sources: data may reside on a local or mounted 
shared storage system (e. g. Lustre storage), HDFS, iRods or 
another cloud object store. Pilot-Data will ensure that the 
data will be available before the Compute-Unit is started. 
As shown in Figure [3] the Pilot-Abstraction enables the 
implementation of complex data workflows, i. e. the stage- 
in/out and processing of data residing in different sources. 
In sect ion [373] we discuss the usage of Pilot-Data for caching 
data during complex processing steps. Pilot-Data is data 
format agnostic, i. e. the implementation of access to the 
data structure is done on application-level (in the CU). It 
supports standard formats, such as text files, CSV but also 
advanced columnar formats (bcolz) or HDF5. 

The Pilot-API provides a unified API across heteroge¬ 
neous resources and gives application-level control to stor¬ 
age and compute resources. The new adaptors enable appli¬ 
cations to seamlessly utilize Hadoop resources inside their 
data-intensive workflows. Depending on the application 
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Figure 4: Pilot-Hadoop for HPC and Cloud Infras¬ 
tructures: Pilot-Hadoop provides uniform frame¬ 
work to managing Hadoop and Spark clusters on re¬ 
sources managed by HPC schedulers, such as PBS, 
SGE and SLURM. The Pilot-API provides a uni¬ 
fying interface for allocating a YARN/Spark cluster 
on HPC and for managing Compute-Units that mar¬ 
shal a YARN, Spark or other ABDS application. 


requirements the API is suited for implementing complex 
data-intensive workflows as well as running scalable ana¬ 
lytics kernels (optimized for complex storage and memory 
hierarchies) on the data. The API relies on affinity labels 
to manage the co-location of data and compute (see |13|). 
Using the API developers can model complex storage hier¬ 
archies consisting of archival storage, cold data storage for 
raw data, warm and hot storage for pre-processed data used 
in the model fitting phase and memory for intermediate re¬ 
sults of iterative machine learning algorithms. The runtime 
system will ensure that data/compute will be co-located if 
possible to improve performance. 

3.2 Pilot-Hadoop: Supporting Application-Level 
Interoperability for ABDS and HPC 

Having discussed the extensions to the Pilot-Abstraction 
to support Hadoop resources, we explore the usage of 
Hadoop and other ABDS frameworks on HPC resources in 
this section (see Figure [T| (b)). Pilot-Hadoop [55] provides 
a framework for executing ABDS applications written for 
YARN (e. g. MapReduce) or Spark on HPC and cloud re¬ 
sources. Pilot-Hadoop has been developed as successor to 
SAGA-Hadoop [5l] utilizing the handling of multi-node clus¬ 
ter environments of the Pilot-Agent. 

Figure [4] illustrates the architecture of Pilot-Hadoop. 
Pilot-Hadoop uses the Pilot-Abstraction and the BigJob 
implementation to manage Hadoop clusters inside an en¬ 
vironment managed by an HPC scheduler, such as PBS, 
SLURM or SGE, or clouds. The Pilot-Framework is used 
for dispatching a bootstrap process that generates the nec¬ 
essary configuration files and for starting the Hadoop pro¬ 
cesses. The specifics of the Hadoop framework (i. e. YARN 
and Spark) are encapsulated in an adaptor. The boot¬ 
strap process is then responsible for launching YARN’S re¬ 
source and node manager processes respectively the Spark 


master and worker agents on the nodes allocated by the 
Pilot-Framework. While nearly all ABDS frameworks (e. g. 
MapReduce, Tez and also Spark) support YARN for re¬ 
source management, Spark provides a standalone cluster 
mode, which is more efficient for dedicated resources. Thus, 
a special adaptor for Spark is provided. 

Once the cluster is setup, users can to submit applications 
by using the Pilot-API’s Compute-Unit API to start and 
manage application processes. Compute-Units with type 
Hadoop and Spark are then forwarded to the YARN re¬ 
spectively Spark resource manager, which then handles the 
management of these tasks. With this capability, the Pilot- 
Abstraction can be used to manage highly heterogeneous 
workloads, e. g. bag-of-tasks, coupled tasks, MPI, Hadoop 
and Spark applications, via a single interface. 

3.3 Pilot-Data Memory: A Processing Engine 
for Machine Learning 

The Pilot-Abstraction provides a low-level mechanisms to 
manage the data across different, possible distributed, data 
stores in conjunction with their task-based computing fo¬ 
cusing on the stage-in and out of data related to a set of 
CUs. Also, it required the developer to either manually 
implement data parallelism or use a higher-level framework, 
such as Pilot-MapReduce [56]. Another constraint is the fact 
that only persistent storage can be used. The usage of (dis¬ 
tributed) memory for caching of input or intermediate data 
(e. g. for iterative machine learning) is not supported. While 
this disk-based model is effective for doing many forms of 
large volume data processing, iterative processing, e. g. for 
machine learning, requires more sophisticated ways to man¬ 
age intermediate data. 

Pilot-Data Memory adds in-memory capabilities to Pilot- 
Data and makes it available via the Pilot-API. A particular 
challenge is the integration of the in-memory layer with the 
compute layer, which typically requires support for a man¬ 
ifold set of tools and programming languages. We focus 
on providing a Python-based API and runtime; many scien¬ 
tific applications are implemented in Python, which makes it 
easy to integrate native code to achieve a good performance. 
Further, there is a lack of in-memory frameworks and tools 
for Python - Pilot-Data Memory is an attempt to address 
these limitations. A particular gap is the ability to manage 
large amounts of distributed memory - while it is fairly sim¬ 
ple to manage memory on a single node (using e. g. memory 
mapped files), support for distributed memory typically re¬ 
quires specialized runtime and processing environments that 
are not compatible with traditional Posix file APIs. 

To process data in an in-memory DU, we extend the DU 
interface to provide a higher-level MapReduce-based API for 
expressing transformations on the data. Using a map and re¬ 
duce functions, applications can express abstract operations 
on data without manually creating CUs for partitioning and 
processing the data. The API utilizes a key/value pair tu¬ 
ples as input for the map and reduce function. The runtime 
system generates the necessary application tasks (Compute- 
Units) and run these in parallel considering data locality 
and other aspects. Users have the possibility to control this 
placement using a simple, label-based affinity model, which 
allows reasoning about compute and data and provides the 
runtime system with hints for optimizing execution. The 
system is data format agnostic and supports heterogeneous 
data schemes (schema on read). 
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Figure 5: Pilot-Data Memory Architecture: The 
Pilot-Manager handles manages a set of queues from 
which the agents are pulling CUs. Data manage¬ 
ment is carried out at the Pilot-Agent-level. The 
Pilot-Agent will stage-in and out data via the Data 
Manager. The Distributed Memory Manager han¬ 
dles the caching of data. CUs are executed via the 
Compute Manager. 


Figure [5] illustrates the architecture of the Pilot-Data 
Memory framework. There are multiple levels of coordi¬ 
nation and decision making: The Compute-Data-Manager 
manages a set of memory and disk-backed Pilot-Data and 
Pilot-Compute. Applications can submit Compute-Units 
(CUs) and Data-Units (DUs) to the Pilot-Manager, which 
exposes control over Pilots, CUs and DUs via the Pilot- 
API. The Compute-Data-Manager will assign submitted 
Compute-Units and Data-Units to a Pilot taking into ac¬ 
count the current available Pilots, their utilization and data 
locality. The Pilot-Agent will stage-in and out data via the 
data manager. The Distributed Memory Manager handles 
the caching of data required for the computation. CUs are 
executed via the Compute Manager. The architecture en¬ 
ables late decision making at runtime: depending on the 
current utilization CUs can be processed by different Pilots. 
Currently, the Pilot-Manager considers both the utilization 
of the Pilot and data locality. In the future, we plan to sup¬ 
port further resource characteristics, such as the amount of 
available memory (critical for memory centric computing), 
as well as the characteristics of the CU workload (providing 
support for co-placing CUs or streaming data between two 
subsequent CUs). 

An important design objective for Pilot-Data Memory is 
extensibility and flexibility. Thus, Pilot-Data provides an 
adaptor mechanism to support different in-memory back¬ 
ends. Currently three different backends are supported: (i) 
file-based, (ii) in-memory Redis and (iii) in-memory Spark. 
Pilot-Hadoop can be used to setup the necessary Spark in¬ 
frastructure on a HPC resource. We further evaluate sup¬ 
port for Tachyon 57] and the HDFS In-Memory storage 
tier 58 . The adaptor service interface specifies the capabili¬ 
ties that need to be implemented by the in-memory backend; 
it consists of functions for allocating/deallocation memory, 
for loading data and for executing a map and reduce func- 
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Figure 6: Pilot-Data: Interoperable Use of Pilot- 
Data on HPC, Cloud and Hadoop Resources: Pilot- 
Data supports different compute infrastructure us¬ 
ing an adaptor mechanism and SAGA (left facet). 
Further, it supports running ABDS frameworks, 
such as Hadoop’s YARN and Spark, on HPC and 
Cloud environments (right facet). 


tions on the data. Depending on the backend the processing 
function need to be implemented either manually, e. g. for 
the file-based and Redis backend adaptor, or can be directly 
delegated to the processing engine as for Spark. The Redis 
and file backends use the Pilot-Job framework for executing 
the CUs generated by Pilot-Data Memory. 

In summary, the Pilot-Abstraction and Pilot-Data Mem¬ 
ory allows applications to seamlessly move data between dif¬ 
ferent forms of storage and memory providing the basis for 
the implementation of complex data workflows, e. g. for fus¬ 
ing different data sources, data filtering, feature extraction 
and for execution complex analytics on top of the data. As 
described in the following section (section [4|, Pilot-Data 
Memory enables the efficient implementation of advanced 
analytics algorithms allowing e. g. the efficient storage of in¬ 
termediate data in memory for iterative processing. 

4. EXPERIMENTS 

We proposed several extensions to the Pilot-Abstraction 
to accommodate Hadoop infrastructures. In this section, 
we investigate the characteristics of the Hadoop adaptor for 
the Pilot-Framework BigJob and Pilot-Hadoop. Further, we 
evaluate the performance of Hadoop MapReduce on HPC 
resources and compare HDFS and Lustre. Finally, we use 
KMeans to validate the suitability of Pilot-Data Memory for 
data analytics applications. 

4.1 Application Framework and Backend In¬ 
teroperability 

In the following, we investigate the performance of the 
new Pilot-Data adaptors for YARN and Mesos as well as 
Pilot-Hadoop. For this experiments, we utilize Mesos 0.14, 
Hadoop 2.6 and Spark 1.1 as well as Amazon EC2 and the 
XSEDE machine Stampede. 

Figure [6] summarizes the results. The left part illustrates 
the usage of the native Pilot-Data to manage CUs on HPC 
(Stampede) and on YARN and Mesos clusters (in this case 
hosted on Amazon Web Services). For short-running jobs 
the scheduler often represents a bottleneck, e. g. the startup 
of a YARN application typically requires several seconds 






















































mainly to due the overhead induced by the JVM startup 
as well as the complex startup process; resources have to be 
requested in two stages: first the application master con¬ 
tainer is allocated followed by the containers for the actual 
compute tasks. 

For the Pilot-Hadoop scenarios we utilize the Stampede 
supercomputer. In addition to the normal overhead for 
starting the Pilot-Agent (see left facet of Figure [6]), some 
extra time is needed for setting up a YARN respectively 
Spark cluster on Stampede. In both cases, we will spawn 
the YARN and Spark daemons without HDFS assuming 
that data will be read from the Lustre storage cluster. Both 
YARN and Spark show a comparable startup time. 

4.2 Exploring Hadoop on HPC 

Pilot-Hadoop enables users to start YARN and Spark clus¬ 
ters on HPC resources managed by a scheduler, such as PBS, 
SLURM or SGE. While the mechanics of launching Hadoop 
on HPC resources are well understood, a challenge remains 
the configuration of Hadoop in an optimal way taking into 
account the specifics of the resource, such as the available 
memory, storage (flash vs. disks) etc., as these can vary even 
within a single infrastructure, such as XSEDE. 

The objective of this experiment is to explore the usage of 
Hadoop on two XSEDE resources: Stampede and Gordon. 
On Stampede the storage space is partitioned into home, 
work, scratch and archival storage. The home, work and 
scratch directories are located on a Lustre filesystem; they 
differ with respect to their quota, backup and purge pol¬ 
icy. The archival storage is located on a remote system and 
not directly mounted on the compute nodes. The amount 
of local space is constrained to 80 GB in the /tmp directory. 
Some resources started to cater more data-intensive work¬ 
loads. Gordon e. g. offers 280 GB flash-based local storage 
per node. However, since the space is transient, this space is 
mainly suitable for intermediate data - otherwise data needs 
to be initially copied to this space. In addition to different 
storage option, we investigate the usage of the new HDFS 
In-Memory feature [58] . 

In the first step we analyze the HDFS and in particular the 
HDFS in-memory performance. For this purpose, we deploy 
an HDFS cluster on Stampede using Pilot-Hadoop using up 
to 512 cores and 32 nodes. HDFS is configured to use the 
local filesystem (/tmp) as data directory. Half of the data 
nodes memory is reserved for the in-memory cache. Further, 
we compare the HDFS performance with the Lustre storage 
available on Stampede. Figure [7] summarizes the results of 
our experiments on Stampede. 

HDFS 2.6 provides storage policies, which enable clients 
to directly store data in the memory of the data node with¬ 
out the need to wait for the persistence and replication of 
the data. This feature requires that the native libraries for 
Hadoop are in place and that the memory parameter for 
HDFS data node (datanode.max.locked.memory). We see 
a consistent minor improvement of the write performance 
when using the in-memory option. However, the overall 
write performance is determined by the non-parallel write 
to HDFS. We investigate two kinds of read performances: 
(i) the read performance using a single client that executes 
a get command and (ii) the parallel read using a MapReduce 
job. The larger the cluster size, the better the parallelism 
for MapReduce parallel reads - as expected Hadoop scales 
near linear in this case. For the in-memory case, we see no 
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Figure 7: HDFS and Lustre Performance on Stam¬ 
pede: The figure shows the time to read/write dif¬ 
ferent data sizes data to/from HDFS and Lustre. 
Lustre performs well for small data sizes, while 
HDFS has a better performance for MapReduce- 
based parallel reads of larger data sizes on larger 
clusters. 

performance improvements for normal reads (case (i)); for 
(ii) we see a minor benefits. 

Further, we investigate the performance of Lustre. In 
particular for small file sizes, Lustre performs well. Sur¬ 
prisingly, also the read performance using a non parallel- 
I/O client was in many cases lower than in the non-parallel 
HDFS client (case (i)). Obviously, for MapReduce work¬ 
loads HDFS clearly outperforms Lustre. MapReduce utilizes 
data locality when assigned map tasks to data file chunk. 
While there is a Hadoop Lustre plugin as part of the In¬ 
tel Enterprise Lustre edition available that utilizes a similar 
mechanism, it is currently not available on Stampede. Thus, 
by default when running Hadoop MapReduce on top of Lus¬ 
tre data-locality is not considered. 

Another concern is the performance of different environ¬ 
ments; in the following we compare the performance of 
Stampede and Gordon. Figure [8] shows the performance 
of both environments. Gordon clearly shows a better per¬ 
formance for HDFS mainly due to the local flash storage 
and more memory in the machines. The performance im¬ 
provement for the in-memory option is in average 9 % in 
comparison to Stampede where the speedup for in-memory 
is in average 14%. 

HDFS heterogeneous storage supports provides a uniform 
interface for reasoning about storage across a single names¬ 
pace, which allows a simplification of existing data flows. 
However, not all YARN/HDFS based tools optimally uti¬ 
lize these capabilities. MapReduce e.g. does not utilize the 
in-memory HDFS features. In the future, it can be expected 
that frameworks such as Spark will make use of these capa¬ 
bilities. 

4.3 Advanced Analytics: KMeans 

KMeans is a classical example of an advanced analytics 
algorithms used for clustering data that shares character¬ 
istics with a broad set of other analytics algorithms. The 
algorithms requires multiple iterations on the data - in each 
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Figure 8: Gordon vs. Stampede on 8 Nodes/128 
Cores: The HDFS performance on Gordon is sig¬ 
nificantly better mainly due to the usage of local 
flash drives compared to disks. The speedup be¬ 
tween flash storage and memory on Gordon is as 
expected lower than between disk and memory on 
Stampede. 
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Figure 9: Pilot-Data Memory KMeans: We compare 
three Pilot-Data backends: (i) file-based and two in- 
memory-based: (ii) Redis and (iii) Spark. The Redis 
and Spark scenario are both executed on Stampede. 
KMeans on Pilot-Data/Spark has the best perfor¬ 
mance with a speedup of up to 212 compared to the 
Pilot-Data/File on Stampede. Also the scale-out ef¬ 
ficiency of Spark is in most cases better than for 
Redis. 


iterations a new candidate location for the cluster centers. 

In the following, we use our Pilot-Data-based KMeans im¬ 
plementation to evaluate different Pilot-Data Memory back¬ 
ends. The computational kernel, which is called iteratively 
takes two parameters: (i) a set of multi-dimensional set of 
vectors to be clustered, remains constant through all the it¬ 
erations, and (ii) a set of centroid vectors. The centroids vec¬ 
tor changes each iteration. For the Pilot-based Redis back¬ 
end, we utilize one Pilot-Compute managing up to 384 cores 
on Stampede for running CUs on the Data-Unit stored in a 
Pilot-Data Memory. The file backend is included for refer¬ 
ence purposes. For the Spark scenario, we utilize Pilot-Spark 
to setup a Spark cluster on Stampede - an XSEDE leader¬ 
ship machine, which is then used to manage both the data 
and compute inside of Spark (via the Pilot-API). We use 
Redis 2.8 and Spark 1.1. The experiment demonstrates in¬ 
teroperability in two dimensions: first, it shows how ABDS 
frameworks can be run on ffPC infrastructures, second it 
demonstrates the versatility of the Pilot-Abstraction for im¬ 
plementing data analytics algorithms on top of different in¬ 
memory runtimes in an infrastructure agnostic way. 

Figure [9] shows the results of the experiments. We investi¬ 
gated three different K-Means scenarios: (i) 1,000,000 points 
and 50 clusters, (ii) 100,000 points and 500 clusters and (iii) 
10,000 points and 5,000 clusters. Each K-Means iteration 
comprises of two phases that naturally map to the MapRe¬ 
duce programming model of Pilot-Data Memory: in the map 
phase the closest centroid for each point is computed; in the 
reduce phase the new centroids are computed as the aver¬ 
age of all points assigned to this centroid. While the com¬ 
putational complexity is defined by the number of points 
x number of clusters (and thereby a constant in the afore¬ 
mentioned scenarios), the amount of data that needs to be 
exchanged during the shuffle phase increases gradually from 
scenario (i) to (iii), with the number of points. 

The performance of Pilot-KMeans improves significantly 
using the memory-based runtimes for the both the data 
points and intermediate cluster locations. The Redis back¬ 


end achieves only a speedup of up to 11, which is significantly 
lower that the speedup of up to 212 achieved with Spark. 
Also, the scale-out efficiency for Spark is better than for Re¬ 
dis in most cases. This is mainly caused by the fact that we 
utilize a non-distributed Redis server. In the future, we will 
evaluate a Redis cluster setup. Both in-memory backends 
scale more efficient than the file-backend. 

5. CONCLUSION AND FUTURE WORK 

As the needs and sophistication of scientific applica¬ 
tions increase, the infrastructures to support these appli¬ 
cations are becoming increasingly heterogeneous and try¬ 
ing to accommodate an increasing number and diversity of 
different workloads. In response, different infrastructures 
have emerged: HPC for tightly-coupled applications, HTC 
for task-level parallelism, clouds for elastic workloads and 
Hadoop for data-intensive workloads. The deluge of tools, 
abstractions and infrastructures lead to complex landscape 
of point solutions characterized by a tight coupling of the 
components and limited interoperability. 

In some ways HPC and Hadoop environments are converg¬ 
ing: increasingly parallel and in-memory computing con¬ 
cepts have emerged in the ABDS environment, e. g. ML- 
Lib/Spark utilizes fine-grained parallelism for implementing 
linear algebra operations for advanced analytics. Although 
the introduction of ABDS concepts and frameworks have 
begun, their uptake remains stymied by multiple reasons, 
one of which is related to finding satisfactory and scalable 
resource management techniques usable for ABDS frame¬ 
works on HPC infrastructure. 

Further, there is a need to map different aspects and stages 
of a data-intensive workflow to the appropriate infrastruc¬ 
ture. Choosing the right infrastructure for an application 
however, is a difficult task: data-intensive workflows typi¬ 
cally comprise of multiple stages with different compute and 
IO requirements. For example, while data filtering and pro¬ 
cessing is best done with Hadoop (using e. g. the MapReduce 















































abstraction), the compute-bound parts of that workflow are 
best supported by HPC environments. 

The Pilot-Abstraction and the Pilot-Data implementation 
enable applications to utilize and explore various paths of 
running Hadoop on HPC (vice versa) supporting end-to-end 
data workflows as well as specific steps, such as iterative ma¬ 
chine learning. In this paper, we demonstrated the usage of 
Pilot-Abstraction as a common, interoperable framework for 
ABDS and HPC, supporting a diverse set of workloads, e. g. 
both I/O bound data preparations tasks (using MapReduce) 
as well as memory-bound analytics tasks (such as KMeans). 
Using the Pilot-Abstraction, applications can combine HPC 
and ABDS frameworks either by running HPC applications 
inside YARN or by deploying the YARN resource manager 
on HPC resources using Pilot-Hadoop. Using these capa¬ 
bilities, applications can compose complex data workflows 
utilizing a diverse set of ABDS and HPC frameworks to en¬ 
able scalable data ingest, feature engineering & extractions 
and analytics stages. Each of these steps has its own I/O, 
memory and CPU characteristics. Providing both a unify¬ 
ing and powerful abstraction that enables all parts of such 
a data pipeline to co-exist is critical. 

Pilot-Data Memory provides a unified access to dis¬ 
tributed memory that is assigned to a Pilot-Compute and 
shared across a set of tasks. The framework using a plug¬ 
gable adaptor mechanism to support different in-memory 
backends, e. g. based on Redis and Spark. We demonstrated 
the effectiveness of the abstraction for iterative analytics ap¬ 
plications using KMeans as example. Pilot-Data Memory 
provides a significantly improved performance with speedups 
of up to 212x compared to the file-based Pilot-Data backend. 

In the future, we plan to extend the Pilot-Abstraction 
on heterogeneous infrastructures using an enhanced set of 
data-intensive applications. To support further use cases, 
we will evaluate support for further operations, e. g. to exe¬ 
cute collectives on data. Further, we will work on a higher- 
level API designed for data analytics applications and li¬ 
braries that supports the expression and execution of data 
pipelines. Also, infrastructures are evolving - container- 
based virtualization (based on Docker [59]) is increasingly 
used in cloud environments and also supported by YARN. 
Support for these emerging infrastructures is being added to 
the Pilot-Abstraction. 
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